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Abstract 

In experiments that are aimed at detecting astrophysical sources such as neutrino 
telescopes, one usually performs a search over a continuous parameter space (e.g. the 
angular coordinates of the sky, and possibly time), looking for the most significant 
deviation from the background hypothesis. Such a procedure inherently involves a 
"look elsewhere effect", namely, the possibility for a signal-like fluctuation to appear 
anywhere within the search range. Correctly estimating the p-value of a given ob- 
servation thus requires repeated simulations of the entire search, a procedure that 
may be prohibitively expansive in terms of CPU resources. Recent results from the 
theory of random fields provide powerful tools which may be used to alleviate this 
difficulty, in a wide range of applications. We review those results and discuss their 
implementation, with a detailed example applied for neutrino point source analysis 
in the IceCube experiment. 
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1. Introduction 

The statistical significance associated with the detection of a signal source is most 
often reported in the form of a p- value, that is, the probability under the background- 
only hypothesis of observing a phenomenon as or even more 'signal-like' than the one 
observed by the experiment. In many simple situations, a p-value can be calculated 
using asymptotic results such as those given by Wilk's theorem [IJ, without the need 
of generating a large number of pseudo-experiments. This is not the case however 
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when the procedure for detecting the source involves a search over some range, for ex- 
ample, when one is trying to observe a hypothetic signal from an astrophysical source 
that can be located at any direction in the sky. Wilk's theorem does not apply in this 
situation since the signal model contains parameters (i.e. the signal location) which 
are not present under the null hypothesis. Estimation of the p-value could be then 
performed by repeated Monte Carlo simulations of the experiment's outcome under 
the background-only hypothesis, but this approach could be highly time consuming 
since for each of those simulations the entire search procedure needs to be applied to 
the data, and to establish a discovery claim at the 5a level (p-value=2.87 x 10"^) the 
simulation needs to be repeated at least (9(10^) times. Fortunately, recent advances 
in the theory of random fields provide analytical tools that can be used to address 
exactly such problems, in a wide range of experimental settings. Such methods could 
be highly valuable for experiments searching for signals over large parameter spaces, 
as the reduction in necessary computation time can be dramatic. Random field the- 
oretic methods were first applied to the statistical hypothesis testing problem in [2j, 
for some special case of a one dimensional problem. A practical implementation of 
this result, aimed at the high-energy physics community, was made in Similar 
results for some cases of multi-dimensional problems |lj|5] were applied to statistical 
tests in the context of brain imaging [6]. More recently, a generalized result dealing 
with random fields over arbitrary Riemannian manifolds was obtained [7j , openning 
the door for a plethora of new possible applications. Here we discuss the implemen- 
tation of these results in the context of the search for astrophysical sources, taking 
IceCube [8J as a specific example. In section |2] the general framework of an hypoth- 
esis test is briefly presented with connection to random fields. In section |3] the main 
theoretical result is presented, and an example is treated in detail in section |4} 

2. Formalism of a search as a statistical test 

The signal search procedure can be formulated as a hypothesis testing problem 
in the following way. The null (background-only) hypothesis Hq : fi = 0, is tested 
against a signal hypothesis Hi : ^ > 0, where /i represents the signal strength. 
Suppose that 6 are some nuisance parameters describing other properties of the 
signal (such as location), which are therefore not present under the null. Additional 
nuisance parameters, denoted by 6', may be present under both hypotheses. Denote 
by ^{n, 6, 9') the likelihood function. One may then construct the profile likelihood 
ratio test statistic [9] 
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max^(/i = 0,^') 
^ ^ max ^(u, 6*, 6*') ^ 

and reject the null hypothesis if the test statistic is larger then some critical value. 
Note that when the signal strength is set to zero the likelihood by definition does 
not depend on 9, and the test statistic ([T]) can therefore be written as 

(l = ^^^li^) (2) 

where q{6) is the profile likelihood ratio with the signal nuisance parameters fixed 
to the point 6, and we have explicitely denoted by ^ the D-dimensional manifold 
to which the parameters 6 belong. Under the conditions of Wilks' theorem [T], for 
any fixed point 6, q{6) follows a distribution with one degree of freedom when 
the null hypothesis is true. When viewed as a function over the manifold q{6) is 
therefore a random field, namely a set of random variables that are continuously 
mapped to the manifold To quantify the significance of a given observation in 
terms of a p-value, one is required to calculate the probability of the maximum of 
the field to be above some level, that is, the excursion probability of the field: 

|)- value = P[maxg(^) > u]. (3) 



Estimation of excursion probabilities has been extensively studied in the frame- 
work of random fields. Despite the seemingly difficult nature of the problem, some 
surprisingly simple closed-form expressions have been derived under general condi- 
tions, which allow to estimate the excursion probability ^ when the level u is large. 
Such 'high' excursions are of course the main subject of interest, since one is inter- 
ested in estimating the p-value for apparently significant (signal-like) fiuctuations. 
We shall briefiy describe the main theoretical results in the following section. For a 
comprehensive and precise definitions, the reader is referred to Ref. [7]. 



3. The excursion sets of random fields 

The excursion set of a field above a level u, denoted by A^, is defined as the set 
of points 6 for which the value of the field q{6) is larger than u, 

Au = {ee^ : q{e) > u} (4) 

and we will denote by (j){Au) the Euler characteristic of the excursion set A^. For 
a 2-dimensional field, the Euler characteristic can be regarded as the number of 
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disconnected components minus the number of 'holes', as is illustrated in FigjTJ A 
fundamental result of [7] states that the expectation of the Euler characteristic (^{Au) 
is given by the following expression: 

D 

E[<l){A^)]=J2^dPd{u). (5) 

d=0 

The coefficients yi^a are related to some geometrical properties of the manifold 
and the covariance structure of the field. For the purpose of the present analysis 
however they can be regarded simply as a set of unknown constants. The functions 
Pd{u) are 'universal' in the sense that they are determined only by the distribution 
type of the field q{0), and their analytic expressions are known for a large class of 
'Gaussian related' fields, such as with arbitrary degrees of freedom. The zeroth 
order term of eq. ([s]) is a special case for which ,yfo and po{u) are generally given by 

^o = 0(^), Po{u)=F[q{e)>u] (6) 

Namely, ^ is the Euler characteristic of the entire manifold and po{u) is the tail 
probability of the distribution of the field. (Note that when the manifold is reduced 
to a point, this result becomes trivial). 




Figure 1: Illustration of the Euler characteristic of some 2-dimensional bodies. 

When the level u is high enough, excursions above u become rare and the excur- 
sion set becomes a few disconnected hyper-ellipses. In that case the Euler character- 
istic (^{Au) simply counts the number of disconnected components that make up A^. 
For even higher levels this number is mostly zero and rarely one, and Its expectation 
therefore converges asymptotically to the excursion probability. We can thus use it 
as an approximation to the excursion probability for large enough u [10] 
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E[(j){Au)] ^ P[maxg(^) > u]. (7) 

The practical importance of Eq. (|5| now becomes clear, as it allows to estimate 
the excursion probabilities above high levels. Furthermore, the problem is reduced 
to finding the constants d > 0. Since Eq. ^ holds for any level u, this could be 
achieved simply by calculating the average of at some low levels, which can be 

done using a small set of Monte Carlo simulations. We shall now turn to a specific 
example where this procedure is demonstrated. 

4. Application to neutrino source detection 

The IceCube experiment |S] is a neutrino telescope located at the south pole and 
aimed at detecting astrophysical neutrino sources. The detector measures the energy 
and angular direction of incoming neutrinos, trying to distinguish an astrophysical 
point-like signal from a large background of atmospheric neutrinos spread across 
the sky. The nuisance parameters over which the search is performed are there- 
fore the angular coordinates (^^, We follow [llj for the definitions of the signal 
and background distributions and the likelihood function. The signal is assumed to 
be spatially Gaussian distributed with a width corresponding to the instrumental 
resolution of 0.7°, and the background from atmospheric neutrinos is assumed to be 
uniform in azimuthal angle. We use a background simulation sample of 67000 events, 
representing roughly a year of data, provided to us by the authors of [llj. We then 
calculate a profile likelihood ratio as described in the previous section. Figure [2] 
shows a "significance map" of the sky, namely the values of the test statistic q{6, if) 
as well as the corresponding excursion set above q = 1. To reduce computation 
time we restrict here the search space to the portion of the sky at declination angle 
27° below the zenith, however all the geometrical features of a full sky search are 
maintained. Note that the most significance point has a value of the test statistic 
above 16, which would correspond to a significance exceeding Aa if this point would 
have been analyzed alone, that is without the "look elsewhere" effect. 

4.I. Computation of the Euler characteristic 

In practice, the test statistic q{9^ ip) is calculated on a grid or points, or 'pixels', 
which are sufficiently smaller than the detector resolution. The computation of the 
Euler characteristic can then be done in a straightforward way, using Euler's formula: 



^The signal model may include additional parameters such as spectral index and time, which we 
do not consider here for simplicity. 
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(a) (b) 

Figure 2: (a) A significance map showing a projection of the test statistic q{9,ip) for a background 
simulation (b) The corresponding excursion set above q = 1. The Euler characteristic in this 
example is 95. 



(P = V-E + F (8) 

where V, E, and F are respectively the numbers of vertices (pixels), edges and 
faces making up the excursion set. An edge is a line connecting two adjacent pixels 
and a face is the square made by connecting the edges of four adjacent pixels. An 
Illustration is given Fig|3l (Alth ough it is most convenient to use a simple square 
grid, other grid types can be used if necessary, in which case the faces would be of 
other polygonal shapes). 

Once the Euler characteristic is calculated, the coefficients of Eq. ([s]) can be 
readily estimated. For a random field with one degree of freedom and for two 
search dimensions, the explicit form of Eq. ([s]) is given by [7] : 

E[4>{A^)] = F[x' >u]+ e-"/^(^ + v^^2). (9) 

To estimate the unknown coefficients we use a set of 20 background sim- 

ulations, and calculate the average Euler characteristic of the excursion set corre- 
sponding to the levels u = 0,1 (The number of required simulations depends on the 
desired accuracy level of the approximation. For most practical purposes, estimating 
the p- value with a relative uncertainty of about 10% should be satisfactory.). This 
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Figure 3: Illustration of the computation of the Euler characteristic using formula ([8|. Each square 
represents a pixel. Here, the number of vertices is 18, the number of edges is 23 and the number of 
faces is 7, giving (/> = 18 — 23 + 7 = 2. 

gives the estimates E[0(Ao)] = 33.5 ±2 and E[0(Ai)] = 94.6 ± 1.3. By solving for the 
unknown coefficients we obtain ,yVi = 33 ± 2 and ^ = 123 ± 3. The prediction of 
Eq. ^ is then compared against a set of approx. 200,000 background simulations, 
where for each one the maximum of q{9, (p) is found by scanning the entire grid. 
The results are shown in Figure |4} As expected, the approximation becomes better 
as the p-value becomes smaller. The agreement between Eq. ^ and the observed 
p-value is maintained up to the smallest p-value that the available statistics allows 
us to estimate. 

4-2. Slicing the parameter space 

A useful property of Eq. ([s]) that can be illustrated by this example, is the ability 
to consider only a small 'slice' of the parameter space from which the expected Euler 
characteristic (and hence p- value) of the entire space can be estimated, if a symmetry 
is present in the problem. This can be done using the 'inclusion-exclusion' property 
of the Euler characteristic: 
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Figure 4: The prediction of Eq. ([9| (dashed red) against the observed p-value (solid blue) from a 
set of 200,000 background simulations. The yellow band represents the statistical uncertainty due 
to the available number of background simulations. 



(j){AUB) = (j){A) + (j){B)-(f){AnB). (10) 

Since the neutrino background distribution is assumed to be uniform in azimuthal 
angle [ip], we can divide the sky to identical slices of azimuthal angle, as illustrated 



in Figure^ Applying (10) to this case, the expected Euler characteristic is given by 



E[0(A„)] = X (E[0(s/zce)] - K[(j){edge)]) + E[(f){Q)] (11) 

where an 'edge' is the line common to two adjacent slices, and 0(0) is the Euler 
characteristic of the point at the origin (see Figure [s]). 

We can now apply Eq. ^ to both cjy^slice) and (l){edge) and estimate the correspond- 
ing coefficients as was done before, using only simulations of a single slice of the sky. 
Following this procedure we obtain for this example with A^ = 18 slices from 40 
background simulations, ^"'^^^ = 6 ± 0.5, J^/''"' = 6.7 ± 0.8 and ^/"'^^ = 4.4 ± 0.2. 



Using (11 ) this leads to the full sky coefficients ^ = 28 ± 9 and ^ = 120 ± 14, a 
result which is consistent with the full sky simulation procedure. This demonstrates 
that the p-value can be accurately estimated by only simulating a small portion of 
the search space. 
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edge 









Figure 5: Illustration of the excursion set in a slice of a sky, showing also an edge (solid blue) and 
the origin as defined in Eq. (11). In this example (f){slice) = 6, (f>{edge) — 2 and (t>{0) = 0. 



5. Summary 

The Euler characteristic formula, a fundamental result from the theory of random 
fields, provides a practical mean of estimating a p-value while taking into account 
the "look elsewhere effect". This result might be particularly useful for experiments 
that involve a search for signal over a large parameter space, such as high energy 
neutrino telescopes. While the example considered here deals with a search in a 
2-dimensional space, the formalism is general and could be in principle applied to 
any number of search dimensions. For example, if one is trying to detect a 'burst' 
event then time would constitute an additional search dimension. In such case the 
method of slicing could be useful as well, as one will not have to simulate the entire 
operating period of the detector but only a small 'slice' of time (provided that the 
background does not vary in time). Thus, the computational burden of having to 
perform a very large number of Monte Carlo simulations in order to to estimate a 
p-value, could be greatly reduced. 
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